Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

Posters

Poster Categories
Poster Schedule
Preparing your Poster - Information and Poster Size
How to mount your poster
Print your poster in Basel

View Posters By Category

Session A: (July 22 and July 23)

Session B: (July 24 and July 25)

Presentation Schedule for July 22, 6:00 pm – 8:00 pm

Presentation Schedule for July 23, 6:00 pm – 8:00 pm

Presentation Schedule for July 24, 6:00 pm – 8:00 pm

Session A Poster Set-up and Dismantle
Session A Posters set up: Monday, July 22 between 7:30 am - 10:00 am
Session A Posters should be removed at 8:00 pm, Tuesday, July 23.

Session B Poster Set-up and Dismantle
Session B Posters set up: Wednesday, July 24 between 7:30 am - 10:00 am
Session B Posters should be removed at 2:00 pm, Thursday, July 25.

H-01: Deep Neural Networks Ensemble for Detecting Medication Mentions in Tweets

COSI: Text Mining (Special Session)

Ari Klein, University of Pennsylvania, United States
Karen O'Connor, University of Pennsylvania, United States
Arjun Magge, Arizona State University, United States
Sarker Abeed, University of Pennsylvania, United States
Weissenbacher Davy, University of Pennsylvania, United States
Gonzalez-Hernandez Graciela, University of Pennsylvania, United States

Short Abstract: Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names may fail due to misspellings or ambiguity with common words, we propose a new method to recognize them. We present Kusuri, a classifier able to identify tweets mentioning drug products and dietary supplements. First, Kusuri applies four different classifiers (lexicon-based, spelling-variant-based, pattern-based and one based on a weakly-trained neural network) in parallel to discover tweets potentially containing medication names. Then, Kusuri classifies the tweets discovered using an ensemble of deep neural networks. On a balanced corpus of 15,005 tweets, Kusuri demonstrated performances close to human annotators with 93.7% F1-score, the best score achieved on this corpus. On a corpus made of all tweets posted by 113 Twitter users, 98,959 tweets with only 0.26% mentioning medications, Kusuri obtained 76.3% F1-score. There is not prior drug extraction system that compares running on such an unbalanced dataset. The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness once integrated in larger natural language processing systems.

H-02: ClaimMiner: Query-guided Claim Mining in Biomedical Literature

COSI: Text Mining (Special Session)

Xuan Wang, University of Illinois at Urbana-Champaign, United States
Qi Li, University of Illinois at Urbana-Champaign, United States
Jiaxin Huang, University of Illinois Urbana-Champaign, United States
Yu Zhang, University of Illinois at Urbana-Champaign, United States
Charles Blatti, University of Illinois at Urbana-Champaign, United States
Mikel Heranez, University of Illinois, at Urbana-Champaign, United States
Jiawei Han, BD2K Center of Excellence @ UIUC, United States

Short Abstract: Motivation: Claim mining is a text mining task that automatically extracts literature evidence to support scientific hypothesis validation. Previous work on claim mining assumes a small set of human-annotated articles of claims be given as the training examples. However, it is non-trivial to select a set of human-annotated articles and the annotation is prone to errors. Results: We propose ClaimMiner, the first query-guided claim mining method for biomedical literature without human-annotated training examples. Given a query, ClaimMiner incorporates the information from the words and entities in the query and textual patterns automatically extracted from massive corpora, extracts the related sentences, and ranks them as claims. Moreover, ClaimMiner allows queries containing general entity types, which has never been explored by previous claim mining methods. ClaimMiner is evaluated on a subset corpus of PubMed and shows great performance on claim ranking.

H-03: Organizing Bioinformatics GitHub Repositories with Multidimensional Text Cube

COSI: Text Mining (Special Session)

Xuan Wang, University of Illinois at Urbana-Champaign, United States
Qi Li, University of Illinois at Urbana-Champaign, United States
Yu Zhang, University of Illinois at Urbana-Champaign, United States
Xiang Ren, University of Southern California, United States
Jiawei Han, BD2K Center of Excellence @ UIUC, United States

Short Abstract: To promote search and sharing of a vast spectrum of biomedical software tools on GitHub, there is an urgent need to organize GitHub-based biomedical code repositories to facilitate flexible search with key attributes (e.g., via topics, programming languages, and release time). Previous studies have a heavy reliance on human efforts to extract such structured information, which are costly, slow and hard to scale. Automatic construction of biological code repository to facilitate search and sharing thus remains an important challenge with significant implications for open-source tool management. We propose a framework to organize bioinformatics GitHub repositories with a multidimensional text cube. Each dimension of the cube stands for a key attribute. For the topic dimension, we adopt a weakly-supervised classification method to assign an appropriate label to each repository. Unlike approaches limited to text classification, our method incorporates a heterogeneous information network to model various kinds of information in a biomedical repository. In contrast to fully-supervised methods based on a large training set, our framework requires only a small set of labeled documents (10 for each category) as user guidance. We conduct extensive experiments on a large-scale GitHub repository dataset and observe evident performance boost over state-of-the-art classification methods.

H-04: Combining databases and text-mining for biological pathway reconstruction

COSI: Text Mining (Special Session)

Salvador Casaní, BioBam Bioinformatics, Spain
Pereira Cecile, Eura Nova, France
Ana Conesa, University of Floria, United States

Short Abstract: Pathway databases are a growing resource, widely used in bioinformatics functional and analytical studies, and manually constructed by expert curators. Although pathway databases are crucial for biological reserach, it can take several years until new discoveries are incorporated to the established networks. Here, we present the Pathway Extract-and-Extend Automatic Reconstruction (PEAR) pipeline, a python module that combine text-mining BioNLP resources with curated databases to create ‘Ad-hoc’ biological pathways. The entities and relationships extracted from text and databases are associated in Neo4j graph database, where we store and extract all the relevant information to reconstruct the biological pathway of interest. PEAR uses orthology to facilitate pathway reconstruction in non-model species. PEAR has been validated by reconstructing 14 E. coli pathways, for which a mean of 65% reactions were reassembled. We also constructed a manually validated human histone acetylation pathway, currently not represented in any of the established pathway databases. Additionally, we show a stress response pathway in Citrus clementina, for which we included orthology relationships with A. thaliana, which improved interpretation of RNA-Seq data on the greening disease caused by Liberacter asiaticum. PEAR is a flexible framework to study biological processes leveraging structured database information with new dynamic literature knowledge.

H-05: Discovery of disease- and drug-specific pathways through community structures of a literature network

COSI: Text Mining (Special Session)

Minh Pham, Baylor College of Medicine, United States
Stephen Wilson, Baylor College of Medicine, United States
Harikumar Govindarajan, Baylor college of Medicine, United States
Chih-Hsu Lin, Baylor College of Medicine, United States
Olivier Lichtarge, Baylor College of Medicine, United States

Short Abstract: In response to the exponential growth of scientific publications, text mining is increasingly used to extract biological pathways. Though multiple tools explore individual connections between genes, diseases, and drugs, not many extensively examine contextual biological pathways for specific drugs and diseases. We extracted 3,444 functional gene groups for specific diseases and drugs by applying a community detection algorithm to a literature network. The network aggregated co-occurrences of Medical Subject Headings (MeSH) terms for genes, diseases, and drugs in publications. The detected literature communities were groups of highly associated genes, diseases, and drugs. The communities significantly captured genetic knowledge of biological pathways and recovered future pathways in time-stamped experiments. Furthermore, the disease- and drug-specific communities recapitulated known pathways for those given diseases and drugs. In addition, diseases in same communities had high comorbidity with each other and drugs in same communities shared great numbers of side effects, suggesting that they shared mechanisms. Indeed, the communities robustly recovered mutual targets for drugs (AUROC = 0.75) and shared pathogenic genes for diseases (AUROC = 0.82). These data show that the literature communities not only represented known biological processes but also suggested novel disease- and drug-specific mechanisms, facilitating disease gene discovery and drug repurposing.

H-06: System development for extracting up-to-date adverse drug effects from MEDLINE

COSI: Text Mining (Special Session)

Yutaro Okano, Tohoku University, Japan
Kengo Kinoshita, Tohoku University, Japan

Short Abstract: Drug therapy is used for treating diseases, but sometimes it causes unexpected adverse drug effects (ADEs) and results in health hazard. Therefore, medical workers should always update their knowledge of ADEs and prevent their patients from them. The FDA adverse event reporting system (FAERS) is the largest pharmacovigilance center, and medical workers, drug manufacturers, and consumers report their ADEs to FAERS and it summarizes those data and opens them to the public. FAERS provides solid evidence of adverse effects, but medical workers sometimes need more recent adverse effects even though they are at research quality. For the purpose MEDLINE database is the quite useful and up-to-date paper database, but it may not useful to search for ADEs, because it is designed for the researchers of divergent medical scientists. In this research, we constructed a system to extract up-to-date ADEs based on publications in the MEDLINE database on a daily basis. With the system, we extracted 7,741 ADEs until Sep. 9, 2018, and identified 138 additional ADEs from Sep. 10, 2018 to Dec. 12, 2018, which corresponded 1.95 ADEs per day on average. We will report some details of the system implementation and discuss the future direction of the system.

H-07: Literature Network: Text mining-based extraction of molecular networks in Cytoscape

COSI: Text Mining (Special Session)

Xuan Qin, HZAU, China
Marc Legeay, Novo Nordisk Foundation Center for Protein Research, Denmark
Lars Juhl Jensen, The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark

Short Abstract: The biomedical literature is a rich but also overwhelming source of information. To help researchers summarize the literature on a topic in the form of a network, we have developed the Literature Network app for Cytoscape, which draws inspiration from both two existing apps, namely AgilentLiteratureSearch and stringApp. The user queries PubMed for abstracts describing the topic and selects the organism for which a network should be produced. The app then queries a weekly updated database of precomputed text-mining results through an API to retrieve 1) a list of genes mentioned in the abstracts and 2) the sentences from the abstracts that mention any two of the genes together. From this information, the Literature Network app constructs a Cytoscape network in which the nodes and edges represent the genes and their co-mentions, respectively. The app allows the user to view the sentences supporting any edge, and all the retrieved information is stored within the Cytoscape session file as provenance.

H-08: Identification of disease-related genes using GloVe in word embedding

COSI: Text Mining (Special Session)

Giup Jang, Gachon University, South Korea
Youngmi Yoon, Department of Computer Engineering, Gachon University, South Korea

Short Abstract: Identifying disease-related genes is an important to understand disease mechanisms and treat patients. To discover disease-related genes, researchers have proposed various studies. Wet lab experiments require a time consuming and costly approach. Recently, efforts are being made to identify disease-related genes using computational methods. In this study, we proposed a novel method to identify disease-related genes from a vast amount of documents using word embedding in natural language processing. Among various word embedding methods, we used GloVe. GloVe is a neural network based word vectorization that combines the advantages of global matrix factorization methods and local context window methods. From the literature, we collected sentences containing a specific disease and constructed a word vector using GloVe. We assumed that genes which are close to the disease in the vector are disease-related genes. The similarities between disease vector and gene vectors were calculated by cosine similarity. We identified disease-related genes with high similarity as candidates. We conducted a comparison with the existing methods with five diseases (Alzheimer's disease, prostate cancer, gastric cancer, colorectal cancer and lung cancer) for validation. We confirmed that the proposed method showed more significant results than the previous methods such as frequency-based and PRINCIPLE.

H-09: Multi-Task Learning using Neural Networks for Biomedical Text Mining

COSI: Text Mining (Special Session)

Martin Hofmann-Apitius, Fraunhofer, Germany
Lisa Langnickel, Fraunhofer, Germany
Sumit Madan, Fraunhofer, Germany
Juliane Fluck, ZBMed Information Centre for Life Sciences, Germany

Short Abstract: With the increasing amount of data, the automated information extraction becomes more and more important in order to gain relevant information and draw scientific conclusions. The application of neural networks for text mining has been shown to achieve promising results. One of the biggest problems of training neural networks is the limited availability of labeled data. This is especially true for the biomedical field because of its complexity. Therefore, this work focuses on the development of a multi-task workflow for named entity recognition (NER) and following relation extraction (RE) from biomedical literature using neural networks. Firstly we apply contextual word models created using unsupervised pre-training (with large amounts of textual data). On top of this, specific models for NER and RE are trained using annotated data. In the preliminary phase, we aim to generate a model that predicts microRNA-disease associations. The keyword "miRNA" yields nowadays more than 80,000 publication entries in PubMed. Since they are promising drug targets, the automatic extraction of miRNA-disease associations is of enormous interest. For the development of the workflow, state-of-the-art methods are adapted, combined and evaluated on the use-case mentioned.

H-10: Social Media Mining for Studying Patient-Reported Birth Defect Outcomes

COSI: Text Mining (Special Session)

Ari Klein, University of Pennsylvania, United States
Sarker Abeed, University of Pennsylvania, United States
Cai Haitao, University of Pennsylvania, United States
Weissenbacher Davy, University of Pennsylvania, United States
Gonzalez-Hernandez Graciela, University of Pennsylvania, United States

Short Abstract: Birth defects are the leading cause of infant mortality, but methods for studying them remain limited. To assess whether social media data could be used to observe pregnancies with birth defect outcomes, we mined 432 million publicly available posts by 112,647 users who have announced their pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a ruled-based, bootstrapping approach that relies on a lexicon, lexical variants, regular expressions, post-processing, and distributional properties. To identify a cohort for epidemiological analysis, inclusion criteria were tweets indicating that the user’s child has a birth defect, and accessibility to the user’s tweets during pregnancy. We manually annotated 16,822 retrieved tweets. We analyzed the Twitter timelines of the 646 users who posted true positive tweets, and identified 195 of them who met the inclusion criteria. Congenital heart defects are the most prevalent birth defects reported on Twitter, consistent with the general population. Social media mining can complement existing methods of birth defects research by providing (1) an opportunity to observe the periconceptional period and the early period of the first trimester, (2) a means of long-term follow-up after birth, (3) internal comparator groups, and (4) an opportunity to explore unknown risk factors.

H-11: An Effective Biomedical Document Classification Scheme in Support of Biocuration: Addressing Class Imbalance

COSI: Text Mining (Special Session)

Xiangying Jiang, University of Delaware, United States
Martin Ringwald, The Jackson Laboratory, United States
Judith Blake, The Jackson Laboratory, United States
Cecilia Arighi, University of Delaware, Computer and Information Sciences Department, United States
Hagit Shatkay, University of Delaware, United States

Short Abstract: The published literature is an important source of information supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. Notably, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential in the context of biomedical document classification. We present here an effective classification scheme for identifying publications containing relevant information for a Mouse Genome Informatics actual curation task involving a large imbalanced dataset. The scheme is based on meta-classification, employing cluster-based under-sampling combined with named-entity recognition (NER) and statistical feature selection strategies. We examined the performance of our method over a large imbalanced dataset, was originally generated and curated by the Jackson Laboratory’s Gene Expression Database (GXD), consisting of more than 90,000 PubMed abstracts. Our results, 0.80 recall, 0.72 precision and 0.75 f-measure, demonstrate that our classification scheme effectively categorizes such a large dataset in the face of data imbalance.

H-12: Predicting disease-gene associations using weighted gene network and literature data

COSI: Text Mining (Special Session)

Youngmi Yoon, Department of Computer Engineering, Gachon University, South Korea
Sangwon Shin, Gachon University, South Korea

Short Abstract: Identifying disease-gene associations is important to understand disease mechanisms. However, identifying all associations by wet-lab experiment is costly. Therefore, predicting associations based on computational methods are increasing. Among them, text mining gains meaningful knowledge from unstructured data. We propose a method to predict disease-gene associations using co-occurrence in the literature and weighted biological network of HumanNet. Using biological literature, co-occurrence of genes was obtained from sentences that mentioned a specific disease. Among gene pairs from literature, we use gene pairs only that overlap with HumanNet that represent the probability of gene interactions. For each gene, we calculate the sum of HumanNet weight and sum of co-occurrence using gene pairs which include itself. Each of them is normalized to z-score and score for each gene is the sum of two values. The top 10 genes are defined to be inferred genes. For validation, inferred genes are compared with known gene-disease associations obtained from public databases to calculate accuracy and we made a comparison with previous methods. Our method shows 90% precision in breast cancer. We confirm that our method outperforms the previous methods using only literature data since we consider gene pairs from literature and HumanNet.

H-13: Deep neural multi-task learning improves biomedical concept recognition

COSI: Text Mining (Special Session)

Negacy Hailu, University of Colorado School of Medicine, United States
Lawrence Hunter, UC Denver, United States

Short Abstract: Background: Biomedical concept recognition is the foundation of biomedical information extraction. In our recent work, we demonstrated that a two-stage machine learning system (i.e. span detection followed by normalization) for biomedical concept recognition, improved state-of-the-art performance on ten biomedical ontologies. The limitation of such two-stage system is that each subtask (i.e. span detection and normalization) is treated independently, and hence knowledge of one subtask is not leveraged in performing the other task. Inspired by the success of transfer learning and multi-task learning, in this work we demonstrate that jointly learning the span detection and normalization subtasks using a deep neural multi-task learning approach improves performance of biomedical concept recognition. Results: By jointly learning the span detection and normalization stages, we improve performance of 7 out of 10 biomedical ontologies. Conclusion: Jointly learning the span detection and normalization stages of biomedical concepts improves performance of biomedical concept recognition because knowledge of detecting span of biomedical concepts is useful to perform normalization and vice versa. This is an ongoing project, and we anticipate a single multi-tasking model for all ontologies and for both subtasks will surpass the results reported here, which are trained per ontology.

H-14: Opportunities and Problems Related to Novel Trainable Protein Representations

COSI: Text Mining (Special Session)

Serbulent Unsal, Karadeniz Technical University, Turkey
Aybar Can Acar, Middle East Technical University, Turkey
Tunca Dogan, European Bioinformatics Institute, Turkey

Short Abstract: One of the key points for accurately predicting protein features/properties is generating a holistic representation of proteins. Using these representations, inherent features of proteins can be learnt efficiently by a machine learning classifier. There are two main types of representations: fixed and trainable. Fixed representations are based on pre-defined rules designed by human experts, mostly inspired from the natural properties of these molecules. Trainable representations, on the other hand, are data and/or task specific, and generated based on the patterns found in the data. Lately, trainable representations are getting popularity in the life-sciences domain. In this study, we aim to investigate the potential of trainable embeddings for protein-function-prediction, especially for the prediction of ontological terms with low number of training instances. For this, we mainly considered novel trainable protein and molecular data representation approaches such as word2vec/doc2vec-based methods inspired by the NLP field and methods from neural network encoding, all of which reported significant improvements over the state-of-the-art in different predictive tasks related to protein science in recent literature. We classified these representation models according to their technical aspects and their objectives, and we discussed our results and propose new directions for protein representation construction.

H-15: Factoid: Making Biological Pathways in Research Articles Easy to Find and Access

COSI: Text Mining (Special Session)

Gary Bader, University of Toronto, Canada
Max Franz, University of Toronto, Canada
Jeffrey Wong, University of Toronto, Canada
Dylan Fong, University of Toronto, Canada
Igor Rodchenkov, University of Toronto, Canada
Funda Durupinar, Oregon Health & Sciences University, United States
Emek Demir, Oregon Health & Sciences University, United States
Augustin Luna, Department of Cell Biology, Harvard Medical School, Boston, MA, USA, United States
Chris Sander, Department of Cell Biology, Harvard Medical School, Boston, MA, USA, United States
Metin Can Siper, Bilkent University, Turkey

Short Abstract: Factoid (factoid.baderlab.org) is a web application that guides authors through the generation of computer-readable records of molecular interactions during manuscript submission. This information may otherwise only be available as part of an article’s text and figures. Factoid aims to be a simple and scalable solution to address obstacles related to pathway data curation while providing up-to-date and accurate interaction data for analytical use. The Factoid project is composed of three components: 1) the Factoid interface, 2) a natural-language processing (NLP) assistant to extract manuscript content, and 3) a storage and retrieval system. The NLP assistant is composed of a Named Entity Recognition (NER) component paired with REACH (Reading and Assembling Contextual and Holistic Mechanisms from Text) for interaction extraction. User sessions involve modification of assistant-generated networks with either a diagrammatic or form-based editor to correctly reflect pathway information from the manuscript. This is followed by the submission of the Factoid for storage. The Factoid will be made available upon acceptance of the author’s manuscript for programmatic retrieval. Presented here are initial results from a pilot project started with journal publishers. Ultimately, Factoid raises awareness about original research articles by helping authors share accurate pathway data with the wider researcher community.

H-16: A benchmark of concept recognition systems

COSI: Text Mining (Special Session)

Fabio Rinaldi, University of Zurich, Switzerland
Lenz Furrer, University of Zurich, Switzerland
Nicola Colic, University of Zurich, Switzerland

Short Abstract: Automatic entity recognition and normalization is fundamental to biomedical text mining. Given biomedical literature’s volume, efficient and scalable tools become essential. This benchmark study compares in terms of speed and coverage four publicly available systems (TaggerOne, Neji, OGER, and Jensen Tagger) built for that task. TaggerOne is a probabilistic system that needs training on an annotated corpus to model the relation between a given controlled vocabulary and the corresponding mentions in text, whereas the other tools use rules for linking surface terms to vocabulary entries (although Neji has an optional probabilistic component for span detection). All systems were run with the task of finding disease mentions in a batch of 30,000 PubMed abstracts. All tools ran on an aptly formatted version of the MEDIC vocabulary; in addition, TaggerOne ran with a bundled model trained on the NCBI disease corpus. The rule-based systems performed 1 to 2 orders of magnitude faster than TaggerOne (approximately 1 minute for serial processing of the batch). Interestingly, faster systems tended to produce more annotations and, in a small manually evaluated sample, more spurious ones. More careful customization to the task at hand, however, is likely to improve the systems’ precision.

H-17: PubTator Central: Automated Concept Annotation of Biomedical Full Text Articles

COSI: Text Mining (Special Session)

Chih-Hsuan Wei, NIH/NLM/NCBI, United States
Alexis Allot, NIH/NLM/NCBI, United States
Robert Leaman, NIH/NLM/NCBI, United States
Zhiyong Lu, NIH/NLM/NCBI, United States

Short Abstract: PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ~300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.

H-18: Extracting Figures, SubFigures and Captions from Biomedical Publications: Toward Well-Targeted Bio-Curation Support

COSI: Text Mining (Special Session)

Pengyuan Li, University of Delaware, United States
Hagit Shatkay, University of Delaware, United States

Short Abstract: Motivation: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest within bio-curation communities to store and to display biomedical images as evidence for biomedical processes and for experimental results. Notably, the first fundamental step, namely extracting figures and captions from biomedical documents is neither well-studied nor yet well-addressed. Moreover, as the vast majority of published figures are compound images consisting of multiple panels, where each individual panel potentially conveys a different type of information, segmenting such images into constituent panels is another necessary step toward displaying and utilizing published images. Methods: We introduce an effective pipeline comprising two systems: PDFigCapX for identifying and extracting figures and captions from biomedical documents, and FigSplit for splitting the extracted compound figures into their constituent subfigures. Results and Significance: We have tested both systems on existing and on newly assembled datasets. The extensive experimental results demonstrate significant improvement and effectiveness compared to other state-of-the-art methods. Our proposed pipeline thus addresses the essential need for extracting figures, subfigures and captions from biomedical publications, in support of curation visualization needs. The systems PDFigCapX and FigSplit are publicly available for use at: (URLs will be provided in the final version).

H-19: An Informatics Map for Understanding Rare Mitochondrial Disease Symptomology

COSI: Text Mining (Special Session)

Calvin T. Schaffer, The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles, United States
Jaewoo Kim, The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles, United States
Anders Olav Garlid, The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles, United States
Vladimir Guevara-Gonzalez, The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles, United States
Hirsh Bhatt, The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles, United States
Harry Caufield, University of California, Los Angeles, United States

Short Abstract: Mitochondrial diseases, many of them rare, are recognized globally as vicious killers. Our limited understanding of these diseases and their pathogenesis leads to delayed diagnoses and a dearth of treatment options, compounded by the fragmented nature of clinical case information and unstructured text data. To impose structure on clinical information relating to rare mitochondrial diseases (RMDs), we created a metadata template for clinical case reports (CCRs) and aggregated over 395 reports on 7 RMDs, including deficiencies in complex I through V of the electron transport chain, carnitine deficiency, and Barth syndrome. We constructed a digital map using standardized ICD-10 codes for a systematic understanding of symptomology among these diseases and established the MitoCases platform (www.mitocases.org) to establish a FAIR (Findable, Accessible, and Interoperable) data resource. Among 52 CCRs on 111 patients with Barth syndrome, we extracted 1,051 instances of 211 unique ICD-10 codes, along with detailed metadata. The landscape of this digital map highlights shared and common symptoms as well as rare and unique characteristics of these diseases, revealing pathogenesis and mechanistic insights underlying RMDs. This standardization and integration with existing ontologies renders metadata FAIR and enables the biomedical community to elevate understanding and improve patient care.

H-20: LexMapr: a rule-based text mining tool for ontology-driven harmonization of short biomedical specimen descriptions

COSI: Text Mining (Special Session)

Gurinder Pal Gosal, University of British Columbia, Vancouver, Canada, Canada
Emma Griffiths, University of British Columbia, Vancouver, Canada, Canada
Damion Dooley, University of British Columbia, Vancouver, Canada, Canada
Ivan Gill, University of British Columbia, Vancouver, Canada, Canada
Dan Fornika, BC Centre for Disease Control, Canada, Canada
Heather Tate, US Food & Drug Administration, USA, United States
Maria Sanchez, US Food & Drug Administration, USA, United States
Ruth Timme, US Food & Drug Administration, USA, United Kingdom
William Hsiao, University of British Columbia, Vancouver, Canada, Canada

Short Abstract: Pathogen sample metadata contain important contextual information for the interpretation of whole genome sequencing-based analyses used for public health responses. This data is often encoded as inconsistent free text, and requires time-consuming and error-prone clean up before investigators can aggregate data for analyses. LexMapr is an open-source, ontology-driven, rule-based, text-mining system developed to harmonize short phrase sample descriptions. LexMapr combines basic lexicographic transformation with light Natural Language Processing and other functionality to standardized text to ontology terms, followed by mapping to different classification schemes. LexMapr performance has been tested for its public health utility on foodborne pathogen sample data from two different surveillance systems - The US FDA’s GenomeTrakr system and The US National Antimicrobial Resistance Monitoring System’s Resistome Tracker platform. LexMapr ontology term coverage, based on >2000 unique samples, was assessed at 89% (accuracy 95%) based on strict criteria. LexMapr is currently available as a locally installable command-line tool, and a user-friendly GUI is under development (https://anaconda.org/bioconda/lexmapr). These results indicate that LexMapr can help increase data interoperability, reusability, and computability for different public health analytical systems. We foresee the use of LexMapr for other content domains by adding selected domain-specific ontologies and rules.

H-21: A visual analytics tool to enhance pathway models with literature-based evidence and confidence: LitPathExplorer

COSI: Text Mining (Special Session)

Chrysoula Zerva, The University of Manchester, United Kingdom
Axel Soto, Institute for Computer Science and Engineering, CONICET–UNS, Argentina, Argentina
Sophia Ananiadou, The University of Manchester, United Kingdom

Short Abstract: Biomedical interaction networks and pathway models are invaluable resources for understanding and experimenting with the mechanisms underpinning complex biological processes. So far, their curation, maintenance and update is carried out mostly manually, involving thorough inspection of related scientific literature and frequent updates. Navigating through and discovering interactions, among the rapidly increasing amounts of scientific literature, is a complex and time-consuming process, which could be aided by computational methods. Text mining and citation analysis have proven to be valuable towards this goal, enabling automated identification of biomolecular interactions in text and linking them to related pathways. Interpretable integration of such methods into pathway tools could expedite pathway curation and related research. We address the aforementioned challenges, proposing LitPathExplorer, which combines advanced text mining methods and interactive visualisation functionalities. LitPathExplorer, mines large document collections to identify corroborating evidence for existing pathway interactions and propose new interactions. Contextual and bibliometric information is used to complement each interaction with confidence metrics. We present use-cases from two cancer research areas, where LitPathExplorer was well received by researchers. We also discuss performance in terms of precision and demonstrate how citation analysis can further improve precision in identifying the relevance of new text-mined interactions to a pathway.

H-22: A text-mined integrated knowledge map for MicroRNAs

COSI: Text Mining (Special Session)

Debarati Roychowdhury, University of Delaware, United States
Cecilia Arighi, University of Delaware, Computer and Information Sciences Department, United States
K. Vijay-Shanker, University of Delaware, United States
Samir Gupta, University of Delaware, United States

Short Abstract: Motivation: miRNAs are essential gene regulators and their dysregulation often lead to disease. Easy access to miRNA information is crucial for exploiting existing knowledge with the aims of designing new experiments, interpreting generated experimental data, connecting facts across publications and generating new hypotheses that build on previous knowledge. Here, we present an integrative text mining approach to collect miRNA information from the literature. Results: We collected 100,000 miRNA-PubMed ID pairs from Medline. We used a set of existing publicly available and in-house developed tools to extract bioentities and their relation with miRNA. The entity pairs include: miRNA-gene to detect miRNA-gene regulation (52,105 relations); miRNA-disease (51,569 relations) and differential expression level-miRNA to capture the differential expression of a miRNA in the context of disease vs. normal/disease stage (7,259 relations); miRNA-biological process (35,273 relations); circulatory miRNAs to capture potential biomarkers (6,599 relations); and miRNA-tissues and organs to provide biological context. Bioentities were normalized to facilitate querying and integration. We built a database and an interface to store and access the integrated data, respectively. Conclusion: We will demonstrate that our resource can assist in answering relevant biological questions by evaluating miRNA-associated signatures in glioblastoma multiforme, the most aggressive form of primary brain tumor.

H-23: OntoMate: a text-mining tool to facilitate curation at the Rat Genome Database

COSI: Text Mining (Special Session)

Monika Tutaj, Medical College of Wisconsin, United States
G. Thomas Hayman, Medical College of Wisconsin, United States
Shur-Jen Wang, Medical College of Wisconsin, United States
Jeffrey L. De Pons, Medical College of Wisconsin, United States
Jyothi Thota, Medical College of Wisconsin, United States
Jennifer R. Smith, Rat Genome Database, Medical College of Wisconsin, United States
Matthew Hoffman, Medical College of Wisconsin, United States
Stanley J.F. Laulederkind, Medical College of Wisconsin, United States
Harika Srividya Nalabolu, Medical College of Wisconsin, United States
Marek Tutaj, Medical College of Wisconsin, United States
Melinda Dwinell, Medical College of Wisconsin, United States
Mary Shimoyama, Medical College of Wisconsin, United States

Short Abstract: The Rat Genome Database (RGD, https://rgd.mcw.edu) is the premier online repository of rat genomic, genetic and physiologic data and is being further developed as a cross-species platform for translational research. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism databases. To aid curators in the task of identifying curatable papers and the relevant data therein, RGD has developed an ontology-driven custom text mining software tool called OntoMate, that is tightly integrated with RGD's curation software. OntoMate's backend tools analyze the text and extract information from articles in order to enrich processed articles with semantic tags, including gene symbols, mutations, species, and terms from eleven ontologies used for curation. Named Entity Recognition for biocuration was implemented using plugins provided by bioNLP frameworks. Results are stored in a Hadoop NoSQL database which can be scaled horizontally. Tagged abstracts matching curator input are presented in a UI that provides user-activated filters and an integrated ontology browser to expedite the curation process. Proposed future enhancements include the ability to recognize tables containing quantitative phenotype data for RGD's PhenoMiner curation in full text articles and, if possible, supplementary materials.

H-24: Learning Structured Knowledge from Clinical Case Reports

COSI: Text Mining (Special Session)

Yichao Zhou, University of California, Los Angeles, United States
Harry Caufield, University of California, Los Angeles, United States
Yizhou Sun, University of California, Los Angeles, United States
Peipei Ping, University of California, Los Angeles, United States
David Liem, BD2K Center of Excellence @ UCLA, United States
Dibakar Sigdel, BD2K Center of Excellence @ UCLA, United States
Kai-Wei Chang, University of California, Los Angeles, United States
Wei Wang, University of California, Los Angeles, United States

Short Abstract: Unstructured data constitute a unique, rapidly expanding type of biomedical data and a treasure trove of undiscovered insights. Much of these data are within published manuscripts: PubMed currently indexes well over 29 million documents, including more than 2 million clinical case reports (CCRs). The giant volume of these text data, their variable structure, and their heterogeneous subdomains collectively present a herculean challenge for the biomedical research community in parsing them for discrete biomedical relationships. The abilities of human readers are amplified by tools to index and discern content within unstructured biomedical text, though scalable methods do not yet exist, requiring automated approaches to be built upon extensive manual annotation and curation. In this talk, we will present our latest research on Clinical Report Extraction and Annotation Technology (CREATe). This approach includes new natural language processing and machine learning models and algorithms capable of accurately recognizing entities corresponding to concepts and events in CCRs, determining their optimal types and relationships using distantly-supervised learning guided by existing ontologies and taxonomies, and minimizing human annotation. Our goals are to extract, organize, and learn from biomedical concepts within unstructured text, and translate them into a unified knowledge representation supporting efficient inference, integration, and interpretation.

H-25: Text Mining for Biomedical Literature-Based Discovery

COSI: Text Mining (Special Session)

Xuan Wang, University of Illinois at Urbana-Champaign, United States
Qi Li, University of Illinois at Urbana-Champaign, United States
Yu Zhang, University of Illinois at Urbana-Champaign, United States
Jiaming Shen, University of Illinois at Urbana-Champaign, United States
Jinfeng Xiao, University of Illinois at Urbana-Champaign, United States
Jiawei Han, BD2K Center of Excellence @ UIUC, United States

Short Abstract: Text mining will play a critical role at harnessing biomedical big data. Our recent research has generated several innovative systems that will enable biomedical discoveries, by exploring massive unlabeled biotext data, with minimal human/expert annotation or labeling efforts. Such a weakly/distantly supervised approach leads to new principles, methods, implementations, and applications for scalable biotext mining, with three systems developed: (1) SetSearch+, (2) AutoBioNER, and (3) ClaimMiner. SetSearch+ is a novel biomedical literature search system. It intelligently handles typed entities and their relations, and thus retrieves more relevant documents for complex biomedical queries containing multiple entities. AutoBioNER is a system that automatically mines and types biomedical entities from text with distant supervision from user-input dictionaries. It achieves great performance on BioNER benchmark datasets and on new types of entities without existing training data. ClaimMiner is a query-guided claim mining system for biomedical literature. Given a query containing concrete biological entities or general entity types (e.g., $Chemical, $Gene), ClaimMiner automatically extracts and ranks claim sentences from massive corpora as literature evidence to support scientific hypothesis validation. We will show the case studies and videos on our three systems and expect they could be useful for mining biological text beyond bio-literature as well.

H-26: Differential Diagnosis Through Knowledge-Graph-Powered NLP

COSI: Text Mining (Special Session)

Linda Wogulis, Elsevier, United States
Craig Stanley, Elsevier, United States
Danielle Walsh, Elsevier, United States
Will Dowling, Elsevier, United States

Short Abstract: It is estimated that misdiagnosis affects >12M patients annually and may be responsible for as many as 100000 deaths; at the same time, the biomedical knowledge available through PubMed continues to grow at an average of >1300 new articles per day. In this work, we hypothesized that the gap that exists between correct diagnosis at the point of care and application of medical literature knowledge can be reduced by automating symptoms extraction from relevant literature using machine learning methods, while preserving the certainty score of each symptom/diagnosis assertion by means of a knowledge graph. In this session we present the results of our differential diagnosis (DDx) method and evaluate them against the Isabel Healthcare DDx tool.

H-27: Automated recognition of functional compound-protein relationships in literature

COSI: Text Mining (Special Session)

Ammar Qaseem, Albert-Ludwigs-University Freiburg, Germany
Kersten Döring, Albert-Ludwigs-University Freiburg, Germany
Kiran K Telukunta, Albert-Ludwigs-University Freiburg, Germany
Michael Becer, Albert-Ludwigs-University Freiburg, Germany
Philippe Thomas, DFKI Language Technology Lab, Berlin, Germany
Stefan Günther, Albert-Ludwigs University Freiburg, Germany

Short Abstract: Motivation: Much effort has been invested in the identification of protein-protein interactions using text mining and machine learning methods. The extraction of functional relationships between chemical compounds and proteins from literature has received much less attention, and no ready-to-use open-source software is so far available for this task. Method: We created a new benchmark dataset of 2,753 sentences from abstracts containing annotations of proteins, small molecules, and their relationships. Two kernel methods were applied to classify these relationships as functional or non- functional, named shallow linguistic and all-paths graph kernel. Furthermore, the benefit of interaction verbs in sentences was evaluated. Results: The cross-validation of the all-paths graph kernel (AUC value: 84%, F1 score:81%) shows slightly better results than the shallow linguistic kernel (AUC value: 81%, F1 score: 79%) on our benchmark dataset. Both models achieve state-of-the-art performance in the research area of relation extraction. Furthermore, the combination of shallow linguistic and all-paths graph kernel could slightly increase the overall performance. We used each of the two kernels to identify functional relationships in all PubMed abstracts (28 million) and provide the results, including recorded processing time.